Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence

نویسندگان

  • Theodora Tsikrika
  • Anastasia Moumtzidou
  • Stefanos Vrochidis
  • Yiannis Kompatsiaris
چکیده

This work investigates the use of focussed crawling techniques for the discovery of environmental multimedia Web resources that provide air quality measurements and forecasts. Focussed crawlers automatically navigate the hyperlinked structure of the Web and select the hyperlinks to follow by estimating their relevance to a given topic, based on evidence obtained from the already downloaded pages. Given that air quality measurements and particularly air quality forecasts are presented not only in textual form, but are most commonly encoded as multimedia, mainly in the form of heatmaps, we propose the combination of textual and visual evidence for predicting the benefit of fetching an unvisited Web resource. First, text classification is applied to select the relevant hyperlinks based on their anchor text, a surrounding text window, and URL terms. Further hyperlinks are selected by combining their text classification score with an image classification score that indicates the presence of heatmaps in their source page. A pilot evaluation indicates that the combination of textual and visual evidence results in improvements in the crawling precision over the use of textual features alone.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Georeferencing Semi-Structured Place-Based Web Resources Using Machine Learning

In recent years, the shared content on the web has had significant growth. A great part of these information are publicly available in the form of semi-strunctured data. Moreover, a significant amount of these information are related to place. Such types of information refer to a location on the earth, however, they do not contain any explicit coordinates. In this research, we tried to georefer...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Hybrid focused crawling on the Surface and the Dark Web

Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed ...

متن کامل

تشخیص ناهنجاری روی وب از طریق ایجاد پروفایل کاربرد دسترسی

Due to increasing in cyber-attacks, the need for web servers attack detection technique has drawn attentions today. Unfortunately, many available security solutions are inefficient in identifying web-based attacks. The main aim of this study is to detect abnormal web navigations based on web usage profiles. In this paper, comparing scrolling behavior of a normal user with an attacker, and simu...

متن کامل

Eye-Tracking Method’ Usage for Understanding the Cognitive Processes in Multimedia Learning

Introduction: Designing multimedia learning environments should consist of the evidence-based study and principals about the human learning process. Eye tracking is a way based on the learner processing of learning materials which presented in multimedia learning environments. The aim of the study was to examine the use of the eye-tracking method to investigate the cognitive processes in m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014